Exploring Data Clustering

Approaches to explore:

Notes on DBSCAN:

There are two parameters to the algorithm, min_samples and eps, which define formally what we mean when we say dense. Higher min_samples or lower eps indicate higher density necessary to form a cluster.

More formally, we define a core sample as being a sample in the dataset such that there exist min_samples other samples within a distance of eps, which are defined as neighbors of the core sample. This tells us that the core sample is in a dense area of the vector space. A cluster is a set of core samples, that can be built by recursively by taking a core sample, finding all of its neighbors that are core samples, finding all of their neighbors that are core samples, and so on. A cluster also has a set of non-core samples, which are samples that are neighbors of a core sample in the cluster but are not themselves core samples. Intuitively, these samples are on the fringes of a cluster.

Any core sample is part of a cluster, by definition. Further, any cluster has at least min_samples points in it, following the definition of a core sample. For any sample that is not a core sample, and does have a distance higher than eps to any core sample, it is considered an outlier by the algorithm.

Consider instead using HDBSCAN



In [6]:

    
import glob
import os
import numpy as np
import pandas as pd
np.set_printoptions(suppress=True)



In [7]:

    
from bokeh.io import output_notebook, show
from bokeh.resources import INLINE
output_notebook(resources=INLINE)









    





    
        
        Loading BokehJS ...



In [8]:

    
import bokeh.plotting
import bokeh.models   # Legend, Range1d
import bokeh.layouts # gridplot
from bokeh.palettes import Colorblind7 as palette



In [9]:

    
import datashader
import collections
import xarray
from datashader.bokeh_ext import InteractiveImage

Get the data

Load the SAS data

Begin by designating where the theoretical SAXS and SANS data files are located.



In [10]:

    
import sys
sys.platform









    Out[10]:





'linux'



In [11]:

    
if sys.platform == 'linux':
    sas_dir = '/home/schowell/data/scratch/sas_clustering/sascalc'
    pr_dir = '/home/schowell/data/scratch/sas_clustering/pr'

elif sys.platform == 'darwin':
    sas_dir = '/Users/schowell/scratch/sas_clustering/sascalc'
    pr_dir = '/Users/schowell/scratch/sas_clustering/pr'

saxs_dir = 'xray'
sans_dir = 'neutron_D2Op_100'
sas_ext = '*.iq'
saxs_search = os.path.join(sas_dir, saxs_dir, sas_ext)
sans_search = os.path.join(sas_dir, sans_dir, sas_ext)
print(saxs_search)
print(sans_search)


pr_ext = '*.pr'
pr_search = os.path.join(pr_dir, pr_ext)
print(pr_search)









    



/home/schowell/data/scratch/sas_clustering/sascalc/xray/*.iq
/home/schowell/data/scratch/sas_clustering/sascalc/neutron_D2Op_100/*.iq
/home/schowell/data/scratch/sas_clustering/pr/*.pr



In [12]:

    
saxs_files = glob.glob(saxs_search)
sans_files = glob.glob(sans_search)
pr_files = glob.glob(pr_search)
saxs_files.sort()
sans_files.sort()
pr_files.sort()
n_saxs = len(saxs_files)
n_sans = len(sans_files)
n_pr = len(pr_files)
print('Found {} SAXS data files'.format(n_saxs))
print('Found {} SANS data files'.format(n_sans))
print('Found {} P(r) data files'.format(n_pr))









    



Found 200 SAXS data files
Found 200 SANS data files
Found 200 P(r) data files



In [13]:

    
saxs_data = []

# load in the first data set to setup the q-mask
first_data = np.loadtxt(saxs_files[0])
q_mask = first_data[:, 0] <= 0.18  # only use data up to 0.18 1/A
first_data[:, 1] /= first_data[0, 1]  # normalize I(0), to 1
first_data = first_data[q_mask]
saxs_data.append(first_data[1:, 1])  # do not use I(0), it is the same for every dataset

# load in the rest of the data
for saxs_file in saxs_files[1:]:
    x_data = np.loadtxt(saxs_file)
    x_data[:, 1] /= x_data[0, 1]
    x_data = x_data[q_mask]
    assert np.allclose(x_data[0, 1], first_data[0, 1]), 'ERROR: data not normalized to I(0)'
    assert np.allclose(x_data[:, 0], first_data[:, 0]), 'ERROR: data not on same Q-grid'
    saxs_data.append(x_data[1:, 1])

saxs_data = np.array(saxs_data)



In [14]:

    
sans_data = []

# load in the first data set to setup the q-mask
first_data = np.loadtxt(sans_files[0])
q_mask = first_data[:, 0] <= 0.18  # only use data up to 0.18 1/A
first_data[:, 1] /= first_data[0, 1]  # normalize I(0), to 1
first_data = first_data[q_mask]
sans_data.append(first_data[1:, 1])  # do not use I(0), it is the same for every dataset

# load in the rest of the data
for sans_file in sans_files[1:]:
    n_data = np.loadtxt(sans_file)
    n_data[:, 1] /= n_data[0, 1]
    n_data = n_data[q_mask]
    assert np.allclose(n_data[0, 1], first_data[0, 1]), 'ERROR: data not normalized'
    assert np.allclose(n_data[:, 0], first_data[:, 0]), 'ERROR: data not on same Q-grid'
    sans_data.append(n_data[1:, 1])
    
sans_data = np.array(sans_data)

Store the Q-values



In [15]:

    
q_saxs = x_data[1:, 0]
q_sans = n_data[1:, 0]
print(q_saxs)    
print(q_sans)









    



[ 0.01  0.02  0.03  0.04  0.05  0.06  0.07  0.08  0.09  0.1   0.11  0.12
  0.13  0.14  0.15  0.16  0.17  0.18]
[ 0.01  0.02  0.03  0.04  0.05  0.06  0.07  0.08  0.09  0.1   0.11  0.12
  0.13  0.14  0.15  0.16  0.17  0.18]



In [16]:

    
np.allclose(q_saxs, q_sans)









    Out[16]:





True

Load the P(r) data



In [17]:

    
pr_data_l = []
n_pr = len(pr_files)
n_r = np.empty(n_pr, dtype=int)

# load in all the data set
for i, pr_file in enumerate(pr_files):
    n_data = np.loadtxt(pr_file, delimiter=',', dtype=int)
    pr_data_l.append(n_data[:, 1])
    n_r[i] = len(n_data)
    
r_max = n_r.max()
pr_data = np.zeros([n_pr, r_max], dtype=int)
for i, n_data in enumerate(pr_data_l):
    pr_data[i, :len(n_data)] = n_data
    
#pr_data = np.array(sans_data)



In [18]:

    
pr_data[:3, :8]









    Out[18]:





array([[     0,  17067,  50223,  97762, 154904, 252311, 345445, 436965],
       [     0,  17068,  50248,  97863, 155068, 252631, 345934, 437560],
       [     0,  17068,  50252,  97875, 155072, 252639, 345932, 437629]])

Visualize the data

Create a dataframe for plotting purposes (each column will be a different I(Q) or P(r) curve)



In [19]:

    
saxs_df = pd.DataFrame(data=saxs_data.T)
saxs_cols = ['c{}'.format(c) for c in saxs_df.columns]
saxs_df.columns = saxs_cols
saxs_df['q'] = q_saxs
saxs_df.head()









    Out[19]:






  
    
      
      c0
      c1
      c2
      c3
      c4
      c5
      c6
      c7
      c8
      c9
      ...
      c191
      c192
      c193
      c194
      c195
      c196
      c197
      c198
      c199
      q
    
  
  
    
      0
      0.917798
      0.917891
      0.917723
      0.917790
      0.916672
      0.916286
      0.917395
      0.917101
      0.917916
      0.920588
      ...
      0.929958
      0.929471
      0.928210
      0.929580
      0.929479
      0.920882
      0.918059
      0.915479
      0.915412
      0.01
    
    
      1
      0.711798
      0.712118
      0.711597
      0.711866
      0.707277
      0.706311
      0.709269
      0.708664
      0.710933
      0.718504
      ...
      0.749546
      0.748034
      0.744269
      0.748370
      0.748000
      0.719319
      0.710378
      0.702664
      0.702479
      0.02
    
    
      2
      0.474336
      0.474924
      0.474168
      0.474622
      0.464832
      0.464067
      0.467067
      0.467008
      0.469496
      0.478496
      ...
      0.528697
      0.526529
      0.521513
      0.526983
      0.526311
      0.479227
      0.465958
      0.455723
      0.455479
      0.03
    
    
      3
      0.287303
      0.288050
      0.287303
      0.287555
      0.273513
      0.273714
      0.274563
      0.275706
      0.276546
      0.281588
      ...
      0.335605
      0.333647
      0.329756
      0.334017
      0.333134
      0.281017
      0.268790
      0.261378
      0.261134
      0.04
    
    
      4
      0.179109
      0.179857
      0.179261
      0.178765
      0.164588
      0.165555
      0.164193
      0.166034
      0.164571
      0.163933
      ...
      0.203950
      0.202840
      0.201395
      0.203008
      0.202101
      0.161176
      0.154756
      0.153420
      0.153134
      0.05
    
  

5 rows × 201 columns



In [20]:

    
sans_df = pd.DataFrame(data=sans_data.T)
sans_cols = ['c{}'.format(c) for c in sans_df.columns]
sans_df.columns = sans_cols
sans_df['q'] = q_sans
sans_df.head()









    Out[20]:






  
    
      
      c0
      c1
      c2
      c3
      c4
      c5
      c6
      c7
      c8
      c9
      ...
      c191
      c192
      c193
      c194
      c195
      c196
      c197
      c198
      c199
      q
    
  
  
    
      0
      0.919420
      0.919513
      0.919345
      0.919370
      0.918168
      0.917815
      0.918866
      0.918630
      0.919370
      0.921933
      ...
      0.931681
      0.931202
      0.929950
      0.931303
      0.931202
      0.922336
      0.919496
      0.916933
      0.916849
      0.01
    
    
      1
      0.716874
      0.717202
      0.716681
      0.716840
      0.711958
      0.711076
      0.713891
      0.713445
      0.715496
      0.722782
      ...
      0.755151
      0.753639
      0.749891
      0.753966
      0.753597
      0.723899
      0.714882
      0.707126
      0.706891
      0.02
    
    
      2
      0.481966
      0.482563
      0.481807
      0.482134
      0.471891
      0.471176
      0.474059
      0.474151
      0.476361
      0.485084
      ...
      0.537605
      0.535420
      0.530387
      0.535857
      0.535176
      0.486118
      0.472580
      0.462151
      0.461815
      0.03
    
    
      3
      0.295185
      0.295966
      0.295202
      0.295420
      0.280815
      0.281000
      0.281840
      0.283017
      0.283639
      0.288622
      ...
      0.345370
      0.343370
      0.339437
      0.343731
      0.342824
      0.287950
      0.275244
      0.267462
      0.267109
      0.04
    
    
      4
      0.185849
      0.186655
      0.186042
      0.185639
      0.170824
      0.171731
      0.170420
      0.172176
      0.170622
      0.170092
      ...
      0.212361
      0.211193
      0.209739
      0.211353
      0.210403
      0.166605
      0.159647
      0.157966
      0.157597
      0.05
    
  

5 rows × 201 columns



In [21]:

    
pr_df = pd.DataFrame(data=pr_data.T)
pr_cols = ['c{}'.format(c) for c in pr_df.columns]
pr_df.columns = pr_cols
r = np.arange(r_max)
pr_df['r'] = r
pr_df.head()









    Out[21]:






  
    
      
      c0
      c1
      c2
      c3
      c4
      c5
      c6
      c7
      c8
      c9
      ...
      c191
      c192
      c193
      c194
      c195
      c196
      c197
      c198
      c199
      r
    
  
  
    
      0
      0
      0
      0
      0
      0
      0
      1
      0
      0
      0
      ...
      2
      1
      1
      1
      2
      0
      0
      0
      0
      0
    
    
      1
      17067
      17068
      17068
      17067
      17070
      17066
      17073
      17069
      17076
      17079
      ...
      17101
      17100
      17099
      17098
      17091
      17066
      17066
      17066
      17066
      1
    
    
      2
      50223
      50248
      50252
      50265
      50216
      50207
      50226
      50219
      50231
      50226
      ...
      50324
      50310
      50304
      50312
      50303
      50202
      50198
      50199
      50200
      2
    
    
      3
      97762
      97863
      97875
      97883
      97600
      97613
      97605
      97608
      97604
      97633
      ...
      97901
      97878
      97836
      97866
      97828
      97547
      97542
      97555
      97547
      3
    
    
      4
      154904
      155068
      155072
      155081
      154428
      154434
      154439
      154443
      154385
      154410
      ...
      154899
      154866
      154798
      154872
      154806
      154143
      154154
      154178
      154185
      4
    
  

5 rows × 201 columns



In [22]:

    
q_range = q_saxs.min(), q_saxs.max()
pq_range = saxs_data.min(), saxs_data.max()
r_range = 0, r_max
pr_range = pr_data.min(), pr_data.max()



In [28]:

    
%%time
sans_canvas = datashader.Canvas(x_range=q_range, y_range=pq_range, plot_height=300, plot_width=300)
# create an ordered-dictionary of each line, aggregated
sans_aggs = collections.OrderedDict((c, sans_canvas.line(sans_df, 'q', c)) for c in sans_cols)
sans_merged = xarray.concat(sans_aggs.values(), dim=pd.Index(sans_cols, name='cols'))
sans_img = datashader.transfer_functions.shade(sans_merged.sum(dim='cols'), how='eq_hist')









    



CPU times: user 1min 31s, sys: 892 ms, total: 1min 32s
Wall time: 1min 32s



In [29]:

    
%%time
saxs_canvas = datashader.Canvas(x_range=q_range, y_range=pq_range, plot_height=300, plot_width=300)
# create an ordered-dictionary of each line, aggregated
saxs_aggs = collections.OrderedDict((c, saxs_canvas.line(saxs_df, 'q', c)) for c in saxs_cols)
saxs_merged = xarray.concat(saxs_aggs.values(), dim=pd.Index(saxs_cols, name='cols'))
saxs_img = datashader.transfer_functions.shade(saxs_merged.sum(dim='cols'), how='eq_hist')









    



CPU times: user 880 ms, sys: 24 ms, total: 904 ms
Wall time: 900 ms



In [30]:

    
%%time
pr_canvas = datashader.Canvas(x_range=r_range, y_range=pr_range, plot_height=300, plot_width=300)
# create an ordered-dictionary of each line, aggregated
pr_aggs = collections.OrderedDict((c, pr_canvas.line(pr_df, 'r', c)) for c in pr_cols)
pr_merged = xarray.concat(pr_aggs.values(), dim=pd.Index(pr_cols, name='cols'))
pr_img = datashader.transfer_functions.shade(pr_merged.sum(dim='cols'), how='eq_hist')









    



CPU times: user 2min 13s, sys: 516 ms, total: 2min 14s
Wall time: 2min 14s



In [172]:

    
def saxs_image(q_range, pq_range, w, h):
    saxs_canvas = datashader.Canvas(x_range=q_range, y_range=pq_range, 
                                    plot_height=h, plot_width=w)
    saxs_aggs = collections.OrderedDict((c, saxs_canvas.line(saxs_df, 'q', c)) for c in saxs_cols)
    saxs_merged = xarray.concat(saxs_aggs.values(), dim=pd.Index(saxs_cols, name='cols'))
    saxs_img = datashader.transfer_functions.shade(saxs_merged.sum(dim='cols'), how='eq_hist')
    return saxs_img

p = bokeh.plotting.figure(x_range=q_range, y_range=pq_range,
                          plot_width=400, plot_height=300, 
                          x_axis_label='Q (1/A)', y_axis_label='P(Q)')
InteractiveImage(p, saxs_image)









    Out[172]:



In [173]:

    
def sans_image(q_range, pq_range, w, h):
    sans_canvas = datashader.Canvas(x_range=q_range, y_range=pq_range, 
                                    plot_height=h, plot_width=w)
    sans_aggs = collections.OrderedDict((c, sans_canvas.line(saxs_df, 'q', c)) for c in sans_cols)
    sans_merged = xarray.concat(sans_aggs.values(), dim=pd.Index(sans_cols, name='cols'))
    sans_img = datashader.transfer_functions.shade(sans_merged.sum(dim='cols'), how='eq_hist')
    return sans_img

p = bokeh.plotting.figure(x_range=q_range, y_range=pq_range,
                          plot_width=400, plot_height=300, 
                          x_axis_label='Q (1/A)', y_axis_label='P(Q)')
InteractiveImage(p, sans_image)









    Out[173]:



In [33]:

    
def pr_image(r_range, pr_range, w, h):
    pr_canvas = datashader.Canvas(x_range=r_range, y_range=pr_range, 
                                  plot_height=h, plot_width=w)
    pr_aggs = collections.OrderedDict((c, pr_canvas.line(pr_df, 'r', c)) for c in pr_cols)
    pr_merged = xarray.concat(pr_aggs.values(), dim=pd.Index(pr_cols, name='cols'))
    pr_img = datashader.transfer_functions.shade(pr_merged.sum(dim='cols'), how='eq_hist')
    return pr_img

p = bokeh.plotting.figure(x_range=r_range, y_range=pr_range,
                          plot_width=400, plot_height=300, 
                          x_axis_label='r (A)', y_axis_label='P(r)')
InteractiveImage(p, pr_image)









    Out[33]:



In [34]:

    
sans_img









    Out[34]:



In [35]:

    
saxs_img









    Out[35]:



In [36]:

    
pr_img









    Out[36]:

Manipulate the data



In [23]:

    
n_samples, n_features = saxs_data.shape # for PCA, should be (n_samples, n_features)
print('# samples: {}\n# features: {}'.format(n_samples, n_features))









    



# samples: 200
# features: 18

Each row in the NumPy array is a separate scattering pattern or pair-distance distribution curve. Each column with be a feature we explore.



In [24]:

    
print(saxs_data[:3, :5])
print(sans_data[:3, :5])
print(pr_data[:3, :5])









    



[[ 0.91779832  0.71179832  0.47433613  0.28730252  0.17910924]
 [ 0.91789076  0.71211765  0.47492437  0.28805042  0.17985714]
 [ 0.91772269  0.71159664  0.47416807  0.28730252  0.1792605 ]]
[[ 0.91942017  0.71687395  0.48196639  0.29518487  0.18584874]
 [ 0.91951261  0.71720168  0.48256303  0.29596639  0.18665546]
 [ 0.91934454  0.71668067  0.48180672  0.29520168  0.18604202]]
[[     0  17067  50223  97762 154904]
 [     0  17068  50248  97863 155068]
 [     0  17068  50252  97875 155072]]



In [25]:

    
min_vals = saxs_data.min(axis=0)
max_vals = saxs_data.max(axis=0)
saxs_range = max_vals - min_vals
print(saxs_range)









    



[ 0.02576471  0.07929412  0.11215126  0.0959916   0.06180672  0.03790756
  0.0335042   0.02414286  0.01343697  0.00860504  0.00817647  0.00815966
  0.00719328  0.00639496  0.00536134  0.00468067  0.00431933  0.00436975]



In [26]:

    
min_vals = sans_data.min(axis=0)
max_vals = sans_data.max(axis=0)
sans_range = max_vals - min_vals
print(sans_range)









    



[ 0.02515126  0.07772269  0.1107395   0.0959916   0.0642437   0.03981513
  0.03482353  0.02517647  0.01416807  0.00884874  0.00886555  0.00848739
  0.00810084  0.00784034  0.0067395   0.00586555  0.00514286  0.00516807]

Notice how the variation depends significantly on Q.



In [27]:

    
min_vals = pr_data.min(axis=0)
max_vals = pr_data.max(axis=0)
pr_range = max_vals - min_vals
print(pr_range)









    



[      2      48     247     684    1449    2601    4696    7199   10348
   14852   20324   26448   33483   42035   51892   63056   76965   91133
  107671  126696  147179  169099  191703  215318  241866  270704  301275
  332337  363255  393855  423570  454518  480211  505771  528938  554157
  581678  610204  632516  657517  678458  701593  720512  736965  751249
  761955  773427  780324  794875  814249  835276  852393  865353  875525
  887152  893570  895052  892918  891731  887365  889148  882761  882142
  874728  862332  841209  832268  814598  794713  775237  763091  755711
  737816  734890  748581  763052  764722  762752  744522  743587  746469
  744480  734704  744036  753297  758225  759160  765943  794522  814081
  836536  857317  874178  886792  905220  920659  930358  923466  914218
  904550  939833  993162 1029340 1072526 1101818 1130918 1143524 1155555
 1158588 1161896 1156140 1148066 1130239 1109465 1080455 1048508 1016441
  980527  943196  915941  895630  874439  848686  824559  796117  768513
  737693  706352  672474  638126  604399  569085  533381  499646  466126
  433257  401017  371104  340401  310902  283300  256205  230039  204547
  179687  156655  134247  113765   95394   78563   63384   50582   40122
   31690   24280   18434   13555    9796    6520    4508    3062    1961
    1206     668     449     233      97      49      20       3]



In [179]:

    
range_pq = bokeh.plotting.figure(x_axis_label='Q (1/A)', y_axis_label='P(Q) range',
                                 width=400, height=300, title='Data Range')
range_pr = bokeh.plotting.figure(x_axis_label='r (A)', y_axis_label='P(r) range',
                                 width=400, height=300, title='Data Range')

range_pq.line(q_saxs, saxs_range, legend='SAXS range', color=palette[0])
range_pq.line(q_sans, sans_range, legend='SANS range', color=palette[1])

range_pr.line(r, pr_range, legend='P(c) range', color=palette[3])
range_pr.legend.location = "bottom_center"

plots = bokeh.layouts.gridplot([[range_pq, range_pr]])
show(plots)

Rescale the data

Originally used StandardScaler but changed to RobustScaler to avoid complications from outliers (which skew the mean)



In [180]:

    
from sklearn.preprocessing import StandardScaler, RobustScaler
saxs_scaler = RobustScaler()
sans_scaler = RobustScaler()
pr_scaler = RobustScaler()



In [181]:

    
saxs_scaler.fit(saxs_data)
sans_scaler.fit(sans_data)
pr_scaler.fit(pr_data)









    Out[181]:





RobustScaler(copy=True, quantile_range=(25.0, 75.0), with_centering=True,
       with_scaling=True)



In [3]:

    
import sklearn.preprocessing



In [4]:

    
scaler = sklearn.preprocessing.RobustScaler



In [182]:

    
scaled_saxs = saxs_scaler.transform(saxs_data)
scaled_sans = sans_scaler.transform(sans_data)
scaled_pr = pr_scaler.transform(pr_data)
print(scaled_saxs[:3, :5])
print(scaled_sans[:3, :5])
print(scaled_pr[:3, :5])









    



[[-0.49091402 -0.40799064 -0.32635368 -0.05543155  0.90684171]
 [-0.4789802  -0.39431297 -0.30578209 -0.02232143  0.95177985]
 [-0.50067806 -0.41662917 -0.33223128 -0.05543155  0.91593032]]
[[-0.48020362 -0.40078512 -0.2940716  -0.03060129  0.90528101]
 [-0.46776018 -0.38620432 -0.27385951  0.00268432  0.95179264]
 [-0.49038462 -0.40938405 -0.29948046 -0.02988547  0.91642442]]
[[ 0.         -0.5         1.08235294  4.81218274  5.23509934]
 [ 0.         -0.3         2.25882353  6.86294416  6.32119205]
 [ 0.         -0.3         2.44705882  7.10659898  6.34768212]]



In [183]:

    
min_vals = scaled_saxs.min(axis=0)
max_vals = scaled_saxs.max(axis=0)
saxs_scaled_range = max_vals - min_vals

min_vals = scaled_sans.min(axis=0)
max_vals = scaled_sans.max(axis=0)
sans_scaled_range = max_vals - min_vals

min_vals = scaled_pr.min(axis=0)
max_vals = scaled_pr.max(axis=0)
pr_scaled_range = max_vals - min_vals

print(saxs_scaled_range)
print(sans_scaled_range)
print(pr_scaled_range)









    



[ 3.32628153  3.39638261  3.92212181  4.24962798  3.71370866  2.55943262
  2.84531668  3.72512156  3.16947473  5.27155727  6.76869565  5.46272855
  4.29611041  4.51632047  2.93333333  3.23367199  3.85018727  4.3153527 ]
[ 3.38574661  3.45789326  3.75147676  4.08840372  3.70397287  2.6151511
  2.778411    3.70562771  3.14405594  4.47133758  7.08053691  4.73622509
  3.96707819  5.22689076  3.37684211  3.18721461  3.80715397  4.3772242 ]
[   2.            9.6          11.62352941   13.88832487    9.59602649
    7.10655738    7.0696274     5.72143851    4.9036844     4.75720692
    4.67809875    4.41664927    4.06692579    3.93650645    3.85498849
    3.83266726    3.77071468    3.64623156    3.53147037    3.50613662
    3.43829976    3.37267143    3.39069299    3.4764736     3.57107158
    3.69373936    3.77897496    3.88733482    4.04068999    4.15430377
    4.24942564    4.2966205     4.29729591    4.21608464    4.10279919
    4.10102386    4.12418397    4.36057783    4.66456367    5.00752633
    4.87748382    4.91452728    4.83880922    4.59214003    4.46796369
    4.25153025    4.19059668    4.08603823    3.89385972    3.80759298
    3.7769744     3.90241591    3.77046987    3.54606769    3.36517564
    3.18799965    3.12309332    3.06850209    3.01848108    3.00673781
    3.08340096    3.14077171    3.05241031    3.04496285    3.05053466
    3.05540033    3.15253627    3.30358169    3.40978662    3.34119675
    3.33836507    3.3060321     3.19216038    3.32404572    3.14076862
    3.12045229    3.13010733    3.08045717    3.09392132    3.08186183
    2.96431064    2.70891647    2.49534355    2.58952754    2.67475405
    2.65975499    2.7164664     2.67946914    2.66936564    2.75566162
    2.80011113    2.88250325    2.85116539    2.81957409    2.90404077
    3.0017272     3.07760605    3.04786328    2.95533944    3.02555693
    3.31146614    3.5200224     3.65808716    3.78568428    3.97231896
    3.86431545    3.59374698    3.57668933    3.70474602    3.42744205
    3.20880594    3.05728417    3.04380605    2.95124245    2.76822899
    2.60975446    2.55453296    2.43585742    2.42520766    2.37195936
    2.36747657    2.36696467    2.4078521     2.46859355    2.59741145
    2.75227032    2.89919816    2.93103974    2.94866171    3.03083311
    3.13202702    3.23462292    3.3504463     3.54506576    3.64991436
    3.83020086    3.88503308    3.99165322    4.08338282    4.35308943
    4.68496775    5.13133518    5.63327477    6.18018113    6.81917629
    7.61922132    8.56604135    9.64416658   11.4956768    13.87487306
   15.24111812   14.96066253   14.33440514   13.61692985   12.68050659
   11.9895935    11.40993266   11.07518372   10.96258932   11.58868895
   13.7309417    17.50892857   20.61538462   19.94029851   54.42424242
  233.           97.           49.           20.            3.        ]

Notice how rescaling the data using the RobustScaler has removed the $Q$ or $r$ dependence from the ranges.



In [184]:

    
range_pq0 = bokeh.plotting.figure(x_axis_label='Q (1/A)', y_axis_label='P(Q) range',
                                 width=400, height=300, title='Range Before')
range_pr0 = bokeh.plotting.figure(x_axis_label='r (A)', y_axis_label='P(r) range',
                                 width=400, height=300, title='Comparison')
range_pq1 = bokeh.plotting.figure(x_axis_label='Q (1/A)', y_axis_label='P(Q) range',
                                 width=400, height=300, title='Comparison')
range_pr1 = bokeh.plotting.figure(x_axis_label='r (A)', y_axis_label='P(r) range',
                                 width=400, height=300, title='Range After')

range_pq0.line(q_saxs, saxs_range, legend='SAXS before', color=palette[0])
range_pq0.line(q_sans, sans_range, legend='SANS before', color=palette[1])
range_pq1.line(q_saxs, saxs_range, legend='SAXS before', color=palette[0])
range_pq1.line(q_sans, sans_range, legend='SANS before', color=palette[1])

range_pq1.line(q_saxs, saxs_scaled_range, legend='SAXS after', color=palette[0], line_dash= [4, 2])
range_pq1.line(q_sans, sans_scaled_range, legend='SANS after', color=palette[1], line_dash= [4, 2])


range_pr0.line(r, pr_range, legend='before', color=palette[3])
range_pr0.line(r, pr_scaled_range, legend='after', color=palette[3], line_dash= [4, 2])
range_pr1.line(r, pr_scaled_range, legend='after', color=palette[3], line_dash= [4, 2])

range_pr0.legend.location = "bottom_center"
range_pr1.legend.location = "top_left"

plots = bokeh.layouts.gridplot([[range_pq0, range_pr0],
                                [range_pq1, range_pr1]])
show(plots)

Note that for P(r), the last 5 points in almost every data set are zero, so the range did not change at all. See this below



In [48]:

    
pr_data[:, -5:]









    Out[48]:





array([[  0,   0,   0,   0,   0],
       [  0,   0,   0,   0,   0],
       [  0,   0,   0,   0,   0],
       [  0,   0,   0,   0,   0],
       [  0,   0,   0,   0,   0],
       [  0,   0,   0,   0,   0],
       [  0,   0,   0,   0,   0],
       [  0,   0,   0,   0,   0],
       [  0,   0,   0,   0,   0],
       [  0,   0,   0,   0,   0],
       [  0,   0,   0,   0,   0],
       [  0,   0,   0,   0,   0],
       [  0,   0,   0,   0,   0],
       [  0,   0,   0,   0,   0],
       [ 33,  10,   1,   0,   0],
       [ 12,   1,   0,   0,   0],
       [  0,   0,   0,   0,   0],
       [  4,   0,   0,   0,   0],
       [  4,   0,   0,   0,   0],
       [  3,   0,   0,   0,   0],
       [ 21,   2,   0,   0,   0],
       [ 15,   1,   0,   0,   0],
       [  9,   0,   0,   0,   0],
       [  0,   0,   0,   0,   0],
       [  0,   0,   0,   0,   0],
       [233,  97,  49,  20,   3],
       [ 62,  37,  17,   0,   0],
       [123,  41,  32,   7,   0],
       [  0,   0,   0,   0,   0],
       [  0,   0,   0,   0,   0],
       [  0,   0,   0,   0,   0],
       [  0,   0,   0,   0,   0],
       [  0,   0,   0,   0,   0],
       [  0,   0,   0,   0,   0],
       [  0,   0,   0,   0,   0],
       [  0,   0,   0,   0,   0],
       [  0,   0,   0,   0,   0],
       [  0,   0,   0,   0,   0],
       [  0,   0,   0,   0,   0],
       [  0,   0,   0,   0,   0],
       [  0,   0,   0,   0,   0],
       [  0,   0,   0,   0,   0],
       [  0,   0,   0,   0,   0],
       [  0,   0,   0,   0,   0],
       [  0,   0,   0,   0,   0],
       [  0,   0,   0,   0,   0],
       [  0,   0,   0,   0,   0],
       [  0,   0,   0,   0,   0],
       [  0,   0,   0,   0,   0],
       [  0,   0,   0,   0,   0],
       [  0,   0,   0,   0,   0],
       [  0,   0,   0,   0,   0],
       [  0,   0,   0,   0,   0],
       [  0,   0,   0,   0,   0],
       [  0,   0,   0,   0,   0],
       [  0,   0,   0,   0,   0],
       [  0,   0,   0,   0,   0],
       [  0,   0,   0,   0,   0],
       [  0,   0,   0,   0,   0],
       [  0,   0,   0,   0,   0],
       [  0,   0,   0,   0,   0],
       [  0,   0,   0,   0,   0],
       [  0,   0,   0,   0,   0],
       [  0,   0,   0,   0,   0],
       [  0,   0,   0,   0,   0],
       [  0,   0,   0,   0,   0],
       [  0,   0,   0,   0,   0],
       [  0,   0,   0,   0,   0],
       [  0,   0,   0,   0,   0],
       [  0,   0,   0,   0,   0],
       [  0,   0,   0,   0,   0],
       [  7,   0,   0,   0,   0],
       [ 25,   2,   0,   0,   0],
       [140,  51,  42,  18,   1],
       [  0,   0,   0,   0,   0],
       [  0,   0,   0,   0,   0],
       [  0,   0,   0,   0,   0],
       [  0,   0,   0,   0,   0],
       [  0,   0,   0,   0,   0],
       [  0,   0,   0,   0,   0],
       [  0,   0,   0,   0,   0],
       [  0,   0,   0,   0,   0],
       [  0,   0,   0,   0,   0],
       [  3,   0,   0,   0,   0],
       [  3,   0,   0,   0,   0],
       [  1,   0,   0,   0,   0],
       [  1,   0,   0,   0,   0],
       [ 10,   2,   0,   0,   0],
       [  4,   0,   0,   0,   0],
       [ 19,   5,   1,   0,   0],
       [ 10,   1,   0,   0,   0],
       [ 10,   1,   0,   0,   0],
       [  0,   0,   0,   0,   0],
       [  0,   0,   0,   0,   0],
       [  0,   0,   0,   0,   0],
       [  0,   0,   0,   0,   0],
       [  0,   0,   0,   0,   0],
       [  0,   0,   0,   0,   0],
       [  0,   0,   0,   0,   0],
       [  0,   0,   0,   0,   0],
       [  0,   0,   0,   0,   0],
       [  0,   0,   0,   0,   0],
       [  0,   0,   0,   0,   0],
       [  0,   0,   0,   0,   0],
       [  0,   0,   0,   0,   0],
       [  0,   0,   0,   0,   0],
       [  0,   0,   0,   0,   0],
       [  0,   0,   0,   0,   0],
       [  0,   0,   0,   0,   0],
       [  0,   0,   0,   0,   0],
       [  0,   0,   0,   0,   0],
       [  0,   0,   0,   0,   0],
       [  0,   0,   0,   0,   0],
       [  0,   0,   0,   0,   0],
       [  0,   0,   0,   0,   0],
       [  0,   0,   0,   0,   0],
       [  0,   0,   0,   0,   0],
       [  0,   0,   0,   0,   0],
       [  0,   0,   0,   0,   0],
       [  0,   0,   0,   0,   0],
       [  0,   0,   0,   0,   0],
       [  0,   0,   0,   0,   0],
       [  0,   0,   0,   0,   0],
       [  0,   0,   0,   0,   0],
       [  0,   0,   0,   0,   0],
       [  0,   0,   0,   0,   0],
       [  0,   0,   0,   0,   0],
       [  0,   0,   0,   0,   0],
       [  0,   0,   0,   0,   0],
       [  4,   0,   0,   0,   0],
       [  5,   0,   0,   0,   0],
       [ 13,   2,   0,   0,   0],
       [ 11,   0,   0,   0,   0],
       [ 11,   2,   0,   0,   0],
       [ 11,   2,   0,   0,   0],
       [ 11,   2,   0,   0,   0],
       [ 11,   2,   0,   0,   0],
       [ 13,   2,   0,   0,   0],
       [ 13,   2,   0,   0,   0],
       [  0,   0,   0,   0,   0],
       [  0,   0,   0,   0,   0],
       [  0,   0,   0,   0,   0],
       [  0,   0,   0,   0,   0],
       [  0,   0,   0,   0,   0],
       [  0,   0,   0,   0,   0],
       [  0,   0,   0,   0,   0],
       [  0,   0,   0,   0,   0],
       [  2,   0,   0,   0,   0],
       [  0,   0,   0,   0,   0],
       [  0,   0,   0,   0,   0],
       [  0,   0,   0,   0,   0],
       [  0,   0,   0,   0,   0],
       [  0,   0,   0,   0,   0],
       [  0,   0,   0,   0,   0],
       [  0,   0,   0,   0,   0],
       [  0,   0,   0,   0,   0],
       [  0,   0,   0,   0,   0],
       [  0,   0,   0,   0,   0],
       [  0,   0,   0,   0,   0],
       [  0,   0,   0,   0,   0],
       [  0,   0,   0,   0,   0],
       [  0,   0,   0,   0,   0],
       [  0,   0,   0,   0,   0],
       [  0,   0,   0,   0,   0],
       [  0,   0,   0,   0,   0],
       [  0,   0,   0,   0,   0],
       [  0,   0,   0,   0,   0],
       [  0,   0,   0,   0,   0],
       [  0,   0,   0,   0,   0],
       [  0,   0,   0,   0,   0],
       [  0,   0,   0,   0,   0],
       [  2,   0,   0,   0,   0],
       [  5,   0,   0,   0,   0],
       [  4,   0,   0,   0,   0],
       [  4,   0,   0,   0,   0],
       [  4,   0,   0,   0,   0],
       [  4,   0,   0,   0,   0],
       [  4,   0,   0,   0,   0],
       [  6,   0,   0,   0,   0],
       [  0,   0,   0,   0,   0],
       [  1,   0,   0,   0,   0],
       [ 14,   3,   0,   0,   0],
       [ 14,   3,   0,   0,   0],
       [  0,   0,   0,   0,   0],
       [  0,   0,   0,   0,   0],
       [  0,   0,   0,   0,   0],
       [  0,   0,   0,   0,   0],
       [  0,   0,   0,   0,   0],
       [  0,   0,   0,   0,   0],
       [  0,   0,   0,   0,   0],
       [  0,   0,   0,   0,   0],
       [  0,   0,   0,   0,   0],
       [  0,   0,   0,   0,   0],
       [  0,   0,   0,   0,   0],
       [  0,   0,   0,   0,   0],
       [  0,   0,   0,   0,   0],
       [  0,   0,   0,   0,   0],
       [  0,   0,   0,   0,   0],
       [  0,   0,   0,   0,   0],
       [  5,   0,   0,   0,   0]])



In [49]:

    
import matplotlib.pyplot as plt
%matplotlib inline



In [50]:

    
def compare_q1_q2(q1, q2):
    plt.figure()
    plt.plot(saxs_data[:,q1], saxs_data[:,q2], 'bo', markersize=5)
    plt.title('original data')
    plt.xlabel(r'$Q_{}$'.format(q1))
    plt.ylabel(r'$Q_{}$'.format(q2))
    plt.figure()
    plt.plot(scaled_saxs[:,q1], scaled_saxs[:,q2], 'bo', markersize=5)
    plt.xlabel(r'$Q_{}$'.format(q1))
    plt.ylabel(r'$Q_{}$'.format(q2))
    plt.title('scaled data')



In [51]:

    
compare_q1_q2(0, 8)



In [52]:

    
plt.figure()
plt.plot(saxs_data[:,0], saxs_data[:,15], 'bo', markersize=5)
plt.title('original data')
plt.xlabel(r'$Q_{}$'.format(0))
plt.ylabel(r'$Q_{%d}$'%15)
plt.figure()
plt.plot(scaled_saxs[:,0], scaled_saxs[:,17], 'bo', markersize=5)
plt.xlabel(r'$Q_{}$'.format(0))
plt.ylabel(r'$Q_{}$'.format(17))
plt.title('scaled data')









    Out[52]:





<matplotlib.text.Text at 0x7f68899b00f0>



In [53]:

    
scaled_saxs.shape









    Out[53]:





(200, 18)



In [54]:

    
i0 = 2
i_compare = 0
for i0 in range(18):
    plt.figure()
    plt.plot(scaled_saxs[:,i0], scaled_saxs[:, i_compare], 'bo')
    plt.plot(scaled_saxs[112,i0], scaled_saxs[112, i_compare], 'rs')
    plt.plot(scaled_saxs[113,i0], scaled_saxs[113, i_compare], 'gs')
    plt.xlabel(r'$Q_{{}}$'.format(i0))
    plt.ylabel(r'$Q_{{}}$'.format(i_compare))

DBSCAN



In [185]:

    
from sklearn.cluster import DBSCAN
from sklearn import metrics

Need to determine what distance, epsilon, to set as the minimum distance for defining core points. To achieve this we will use the $k$-dist plot of the distance to the $k^{th}$ nearest neighbor:

Plot the sorted $k$-dist
Find the point of greatest curvature, largest 2$^{nd}$ derivative
Use that distance



In [32]:

    
from scipy.spatial.distance import pdist, cdist



In [131]:

    
scaled_saxs









    



---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-131-97364101979b> in <module>()
----> 1 scaled_saxs

NameError: name 'scaled_saxs' is not defined



In [132]:

    
dist = np.zeros([n_samples, n_samples])
i = np.arange(n_samples)
dist[i[:, None] < i] = pdist(scaled_saxs)  # get the distances, consider other distance metrics

i_upper = np.triu_indices(n_samples, 1)
i_lower = np.tril_indices(n_samples, -1)
dist[i_lower] = dist[i_upper]  # make the matrix symmetric

np.fill_diagonal(dist, np.inf)  # fill the diagonal with large value for simplicity









    



---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-132-0fd30e5a2135> in <module>()
      1 dist = np.zeros([n_samples, n_samples])
      2 i = np.arange(n_samples)
----> 3 dist[i[:, None] < i] = pdist(scaled_saxs)  # get the distances, consider other distance metrics
      4 
      5 i_upper = np.triu_indices(n_samples, 1)

NameError: name 'scaled_saxs' is not defined



In [133]:

    
k_dist = dist.min(axis=1)
k_dist.sort()



In [134]:

    
k_dist[:10]









    Out[134]:





array([ 0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.])



In [135]:

    
raw_dist = np.zeros([n_samples, n_samples])
i = np.arange(n_samples)
raw_dist[i[:, None] < i] = pdist(saxs_data)  # get the distances, consider other distance metrics

i_upper = np.triu_indices(n_samples, 1)
i_lower = np.tril_indices(n_samples, -1)
raw_dist[i_lower] = raw_dist.T[i_lower]  # make the matrix symmetric
assert np.allclose(raw_dist.T, raw_dist)

np.fill_diagonal(raw_dist, np.inf)  # fill the diagonal with large value for simplicity



In [136]:

    
raw_k_dist = raw_dist.min(axis=0)
raw_k_dist.sort()



In [137]:

    
raw_k_dist.shape









    Out[137]:





(200,)



In [138]:

    
raw_k_dist[:10]









    Out[138]:





array([ 0.00001456,  0.00001456,  0.00009051,  0.00009051,  0.00015495,
        0.00015495,  0.00038965,  0.00038965,  0.00064853,  0.00064853])



In [139]:

    
raw_dist[:, 4].min()









    Out[139]:





0.0039112840329649104



In [140]:

    
raw_dist[:5, :5]









    Out[140]:





array([[        inf,  0.0014716 ,  0.00104904,  0.00283074,  0.02590177],
       [ 0.0014716 ,         inf,  0.00166125,  0.00340055,  0.02734865],
       [ 0.00104904,  0.00166125,         inf,  0.00257087,  0.02609299],
       [ 0.00283074,  0.00340055,  0.00257087,         inf,  0.02566253],
       [ 0.02590177,  0.02734865,  0.02609299,  0.02566253,         inf]])



In [141]:

    
raw_dist[4, :].min()









    Out[141]:





0.0039112840329649104



In [142]:

    
raw_dist0 = np.zeros(n_samples)
ind = np.arange(n_samples)
for i in ind:
    dist = cdist(saxs_data[i].reshape([1, -1]), saxs_data[ind!=i])
    raw_dist0[i] = dist.min()



In [143]:

    
raw_dist0.sort()
raw_dist0[:10]









    Out[143]:





array([ 0.00001456,  0.00001456,  0.00009051,  0.00009051,  0.00015495,
        0.00015495,  0.00038965,  0.00038965,  0.00064853,  0.00064853])



In [144]:

    
cdist(saxs_data[4].reshape([1, -1]), saxs_data[ind!=4]).min()









    Out[144]:





0.0039112840329649104



In [145]:

    
raw_k_dist-raw_dist0









    Out[145]:





array([ 0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,
        0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,
        0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,
        0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,
        0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,
        0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,
        0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,
        0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,
        0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,
        0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,
        0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,
        0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,
        0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,
        0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,
        0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,
        0.,  0.,  0.,  0.,  0.])



In [146]:

    
k_plot = bokeh.plotting.figure(y_axis_label='distance')

k_plot.line(ind, raw_k_dist, color=palette[0], 
            line_dash=[4,2], legend='pdist')
k_plot.line(ind, raw_dist0, color=palette[1], 
            line_dash=[8,4], legend='cdist')

k_plot.legend.location = "top_left"

bokeh.plotting.show(k_plot)



In [148]:

    
def smooth_diff_n5(x, y):
    h_all = x[1:] - x[:-1]
    h = h_all[0]
    assert np.allclose(h_all, h), 'inconsistent dx'
    dy = (2 * (y[3:-1] - y[1:-3]) + y[4:] - y[:-4]) / (8 * h)
    
    dydx = np.vstack([x[2:-2], dy]).T
    return dydx



In [149]:

    
def smooth_diff_n7(x, y):
    h_all = x[1:] - x[:-1]
    h = h_all[0]
    assert np.allclose(h_all, h), 'inconsistent dx'
    
    dy = (5 * (y[4:-2] - y[2:-4]) + 4 * (y[5:-1] - y[1:-5]) + y[6:] - y[:-6]) / (32 * h)
    
    dydx = np.vstack([x[3:-3], dy]).T
    return dydx



In [153]:

    
ind









    Out[153]:





array([  0,   1,   2,   3,   4,   5,   6,   7,   8,   9,  10,  11,  12,
        13,  14,  15,  16,  17,  18,  19,  20,  21,  22,  23,  24,  25,
        26,  27,  28,  29,  30,  31,  32,  33,  34,  35,  36,  37,  38,
        39,  40,  41,  42,  43,  44,  45,  46,  47,  48,  49,  50,  51,
        52,  53,  54,  55,  56,  57,  58,  59,  60,  61,  62,  63,  64,
        65,  66,  67,  68,  69,  70,  71,  72,  73,  74,  75,  76,  77,
        78,  79,  80,  81,  82,  83,  84,  85,  86,  87,  88,  89,  90,
        91,  92,  93,  94,  95,  96,  97,  98,  99, 100, 101, 102, 103,
       104, 105, 106, 107, 108, 109, 110, 111, 112, 113, 114, 115, 116,
       117, 118, 119, 120, 121, 122, 123, 124, 125, 126, 127, 128, 129,
       130, 131, 132, 133, 134, 135, 136, 137, 138, 139, 140, 141, 142,
       143, 144, 145, 146, 147, 148, 149, 150, 151, 152, 153, 154, 155,
       156, 157, 158, 159, 160, 161, 162, 163, 164, 165, 166, 167, 168,
       169, 170, 171, 172, 173, 174, 175, 176, 177, 178, 179, 180, 181,
       182, 183, 184, 185, 186, 187, 188, 189, 190, 191, 192, 193, 194,
       195, 196, 197, 198, 199])



In [165]:

    
raw_k_dist









    Out[165]:





array([ 0.00001456,  0.00001456,  0.00009051,  0.00009051,  0.00015495,
        0.00015495,  0.00038965,  0.00038965,  0.00064853,  0.00064853,
        0.00065081,  0.00065081,  0.00067531,  0.00067531,  0.00068687,
        0.00068687,  0.00071902,  0.00071902,  0.00073744,  0.00073744,
        0.00082344,  0.00082344,  0.00085574,  0.00085574,  0.00086444,
        0.00086444,  0.0009241 ,  0.0009241 ,  0.00103314,  0.00103314,
        0.00104904,  0.00104904,  0.00117866,  0.00117866,  0.00119248,
        0.00122225,  0.00127235,  0.00127235,  0.00127382,  0.00127382,
        0.00135341,  0.00135341,  0.00137713,  0.00141955,  0.00141955,
        0.0014716 ,  0.00147581,  0.00147581,  0.00150385,  0.00160462,
        0.00169286,  0.00169286,  0.00188305,  0.00188305,  0.00190914,
        0.00190914,  0.00191853,  0.00204316,  0.00220519,  0.00220519,
        0.00221297,  0.00221297,  0.0023853 ,  0.00252364,  0.00257087,
        0.00259429,  0.00263096,  0.00265477,  0.00265477,  0.00269574,
        0.00269574,  0.00274866,  0.00274866,  0.00280041,  0.00282833,
        0.00283103,  0.00283103,  0.00285891,  0.00285891,  0.00288584,
        0.00296025,  0.00296025,  0.0030044 ,  0.00303831,  0.00306513,
        0.00313283,  0.00313283,  0.00318331,  0.00318331,  0.00319813,
        0.0032523 ,  0.0033773 ,  0.0033773 ,  0.00345248,  0.00348461,
        0.00348461,  0.00349861,  0.00349861,  0.00359851,  0.00364449,
        0.00364449,  0.00366362,  0.00366362,  0.00367506,  0.00372543,
        0.0037328 ,  0.00375183,  0.00375183,  0.00377693,  0.00377693,
        0.00378119,  0.00378119,  0.00379734,  0.00388158,  0.00388158,
        0.00391128,  0.00391128,  0.00392403,  0.00393383,  0.00394925,
        0.00399038,  0.00399038,  0.00405248,  0.00414293,  0.00415781,
        0.00429495,  0.00434333,  0.00434333,  0.00436079,  0.00436079,
        0.00442148,  0.00443727,  0.00443727,  0.00447021,  0.00453043,
        0.0045865 ,  0.00461872,  0.00470169,  0.00477683,  0.00478941,
        0.00478941,  0.0048848 ,  0.00489099,  0.00489661,  0.00491209,
        0.00494634,  0.00500364,  0.00500364,  0.00502461,  0.00504536,
        0.00504536,  0.00505301,  0.00507093,  0.00508292,  0.00508292,
        0.00511697,  0.00529451,  0.0054369 ,  0.0054369 ,  0.00557138,
        0.00561624,  0.00572227,  0.00584073,  0.0059103 ,  0.00599123,
        0.00621831,  0.00621894,  0.0063196 ,  0.00633555,  0.00644566,
        0.00654345,  0.00655973,  0.00655973,  0.00667972,  0.00675611,
        0.00686773,  0.00693923,  0.00711057,  0.00721525,  0.00721525,
        0.00724698,  0.00724698,  0.00742144,  0.00808267,  0.0082362 ,
        0.0084389 ,  0.00865306,  0.00865306,  0.00889643,  0.00948668,
        0.00950688,  0.00971971,  0.01017545,  0.01043254,  0.01155862,
        0.01228613,  0.01228613,  0.0125944 ,  0.01294705,  0.01950195])



In [170]:

    
i2









    Out[170]:





array([  0,   2,   4,   6,   8,  10,  12,  14,  16,  18,  20,  22,  24,
        26,  28,  30,  32,  34,  36,  38,  40,  42,  44,  46,  48,  50,
        52,  54,  56,  58,  60,  62,  64,  66,  68,  70,  72,  74,  76,
        78,  80,  82,  84,  86,  88,  90,  92,  94,  96,  98, 100, 102,
       104, 106, 108, 110, 112, 114, 116, 118, 120, 122, 124, 126, 128,
       130, 132, 134, 136, 138, 140, 142, 144, 146, 148, 150, 152, 154,
       156, 158, 160, 162, 164, 166, 168, 170, 172, 174, 176, 178, 180,
       182, 184, 186, 188, 190, 192, 194, 196, 198])



In [171]:

    
ind[::2]









    Out[171]:





array([  0,   2,   4,   6,   8,  10,  12,  14,  16,  18,  20,  22,  24,
        26,  28,  30,  32,  34,  36,  38,  40,  42,  44,  46,  48,  50,
        52,  54,  56,  58,  60,  62,  64,  66,  68,  70,  72,  74,  76,
        78,  80,  82,  84,  86,  88,  90,  92,  94,  96,  98, 100, 102,
       104, 106, 108, 110, 112, 114, 116, 118, 120, 122, 124, 126, 128,
       130, 132, 134, 136, 138, 140, 142, 144, 146, 148, 150, 152, 154,
       156, 158, 160, 162, 164, 166, 168, 170, 172, 174, 176, 178, 180,
       182, 184, 186, 188, 190, 192, 194, 196, 198])



In [ ]:

    
tip_to_tail =



In [175]:

    
k_plot = bokeh.plotting.figure(y_axis_label='distance')

# k_plot.line(i, k_dist/k_dist.max(), color=palette[0], legend='scaled')
k_plot.line(ind[::2], raw_k_dist[::2]/raw_k_dist.max(), color=palette[0], 
            line_dash=[4,2], legend='raw')

n = 1
i2 = ind[::2]
exponents = np.polyfit(i2[:75], raw_k_dist[:150:2]/raw_k_dist.max(), n)
fit = np.poly1d(exponents)
y = fit(ind[::2])
k_plot.line(i2, y, color=palette[1], legend='fit ({})'.format(n))

d1_norm = smooth_diff_n5(ind[::2], raw_k_dist[::2]/raw_k_dist.max())
# k_plot.line(d1_norm[:, 0], d1_norm[:, 1], color=palette[2], line_dash=[4,2], legend='d1')

d1_fit = np.empty([n_samples/2-1, 2])
d1_fit[:, 0] = i2[:-1]
d1_fit[:, 1] = np.diff(y)/np.diff(i2)
k_plot.line(d1_fit[:, 0], d1_fit[:, 1], color=palette[3], legend='d1 fit')

d2 = smooth_diff_n5(d1_norm[:, 0], d1_norm[:, 1])
k_plot.line(d2[:, 0], d2[:, 1], color=palette[4], line_dash=[4,2], legend='d2')

d2_fit = np.empty([n_samples/2-2, 2])
d2_fit[:, 0] = i2[1:-1]
d2_fit[:, 1] = np.diff(d1_fit[:, 1])/np.diff(d1_fit[:, 0])
k_plot.line(d2_fit[:, 0], d2_fit[:, 1], color=palette[5], legend='d2 fit')

d3 = smooth_diff_n5(d2[:, 0], d2[:, 1])
k_plot.line(d3[:, 0], d3[:, 1], color=palette[6], line_dash=[4,2], legend='d3')

d3_fit = smooth_diff_n5(d2_fit[:, 0], d2_fit[:, 1])
k_plot.line(d3_fit[:, 0], d3_fit[:, 1], color=palette[1], legend='d3 fit')

#k_plot.line(i[1:-1], d2_dist, color=palette[0], line_dash=[5,2], legend='scaled d2')
# k_plot.line(i[1:-1], d2_k_dist, color=palette[2], line_dash=[10,1], legend='scaled np d2')

# k_plot.line(i[1:], d_raw_k_dist, color=palette[3], line_dash=[10,1], legend='raw d')
# k_plot.line(i[1:-1], d2_raw_dist, color=palette[1], line_dash=[5,2], legend='raw d2')
              
k_plot.legend.location = "top_left"

bokeh.plotting.show(k_plot)









    



/home/schowell/data/myPrograms/anaconda3/envs/py35/lib/python3.5/site-packages/ipykernel/__main__.py:17: VisibleDeprecationWarning: using a non-integer number instead of an integer will result in an error in the future
/home/schowell/data/myPrograms/anaconda3/envs/py35/lib/python3.5/site-packages/ipykernel/__main__.py:25: VisibleDeprecationWarning: using a non-integer number instead of an integer will result in an error in the future



In [151]:

    
y









    Out[151]:





0.92656301440735367

Find the elbow, or point of maximum curvature



In [231]:

    
x = np.arange(10)



In [238]:

    
xx = np.vstack([x,x]).T



In [249]:

    
d1_norm = smooth_diff(i, raw_k_dist/raw_k_dist.max())



In [253]:

    
d1_raw = smooth_diff(i, raw_k_dist)



In [273]:

    
x = np.linspace(0, 2*np.pi)
y = np.sin(x)
dydx = smooth_diff_n5(x, y)
d2ydx2 = smooth_diff_n5(dydx[:, 0], dydx[:, 1])

p = bokeh.plotting.figure()
p.line(x, y, legend='sin(x)', color=palette[0])
p.line(dydx[:, 0], dydx[:, 1], legend='cos(x)', color=palette[1])
p.line(d2ydx2[:, 0], d2ydx2[:, 1], legend='-sin(x)', color=palette[2])

bokeh.plotting.show(p)



In [303]:

    
# Compute DBSCAN
## Tune these parameters to adjust cluster size ##
distance = 1  # <--need to explore tuning this parameter
min_samples = 2
##################################################
x_db = DBSCAN(eps=distance, min_samples=min_samples).fit(scaled_saxs)
x_core_samples_mask = np.zeros_like(x_db.labels_, dtype=bool)
x_core_samples_mask[x_db.core_sample_indices_] = True
saxs_labels = x_db.labels_ + 1 # 0's are independent groups
x_clusters_ = len(set(saxs_labels)) - (1 if -1 in saxs_labels else 0)

n_db = DBSCAN(eps=distance, min_samples=min_samples).fit(scaled_sans)
n_core_samples_mask = np.zeros_like(n_db.labels_, dtype=bool)
n_core_samples_mask[n_db.core_sample_indices_] = True
sans_labels = n_db.labels_ + 1 # 0's are independent groups
n_clusters_ = len(set(sans_labels)) - (1 if -1 in sans_labels else 0)

p_db = DBSCAN(eps=distance, min_samples=min_samples).fit(scaled_pr)
p_core_samples_mask = np.zeros_like(p_db.labels_, dtype=bool)
p_core_samples_mask[p_db.core_sample_indices_] = True
pr_labels = p_db.labels_ + 1 # 0's are independent groups
p_clusters_ = len(set(pr_labels)) - (1 if -1 in pr_labels else 0)



In [267]:

    
# x-ray clusters
x_unique = set(saxs_labels)
x_unique.remove(0)
print('cluster labels: {}'.format(x_unique))
print('unique clusters: {}'.format(len(x_unique) + list(saxs_labels).count(0)))
for c in set(saxs_labels):
    print('{}: {}'.format(c+1, list(saxs_labels).count(c)))









    



cluster labels: {1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39}
unique clusters: 86
1: 47
2: 4
3: 2
4: 2
5: 6
6: 2
7: 2
8: 11
9: 6
10: 2
11: 5
12: 2
13: 2
14: 2
15: 2
16: 2
17: 2
18: 2
19: 2
20: 5
21: 5
22: 18
23: 2
24: 2
25: 2
26: 2
27: 2
28: 2
29: 18
30: 4
31: 2
32: 2
33: 3
34: 2
35: 11
36: 2
37: 2
38: 2
39: 3
40: 4



In [268]:

    
# neutron clusters
unique = set(sans_labels)
unique.remove(0)
total_clusters = len(unique) + list(sans_labels).count(0)
print('cluster labels: {}'.format(unique))
print('unique clusters: {}'.format(total_clusters))
for c in set(sans_labels):
    print('{}: {}'.format(c, list(sans_labels).count(c)))









    



cluster labels: {1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39}
unique clusters: 86
0: 47
1: 4
2: 2
3: 2
4: 6
5: 2
6: 2
7: 11
8: 6
9: 2
10: 5
11: 2
12: 2
13: 2
14: 2
15: 2
16: 2
17: 2
18: 2
19: 5
20: 5
21: 18
22: 2
23: 2
24: 2
25: 2
26: 2
27: 2
28: 18
29: 4
30: 2
31: 2
32: 3
33: 2
34: 11
35: 2
36: 2
37: 2
38: 3
39: 4



In [269]:

    
# pr clusters
unique = set(pr_labels)
unique.remove(0)
total_clusters = len(unique) + list(pr_labels).count(0)
print('cluster labels: {}'.format(unique))
print('unique clusters: {}'.format(total_clusters))
for c in set(pr_labels):
    print('{}: {}'.format(c, list(pr_labels).count(c)))









    



cluster labels: {1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23}
unique clusters: 166
0: 143
1: 2
2: 2
3: 2
4: 3
5: 2
6: 2
7: 2
8: 2
9: 2
10: 2
11: 5
12: 2
13: 2
14: 4
15: 2
16: 2
17: 2
18: 3
19: 5
20: 3
21: 2
22: 2
23: 2



In [ ]:

    
np.savetxt('saxs_clusters.txt', saxs_labels, fmt='%d')
np.savetxt('sans_clusters.txt', sans_labels, fmt='%d')
np.savetxt('pr_clusters.txt', pr_labels, fmt='%d')



In [ ]:

    
print(sans_labels)
print(sans_labels.shape)
slabels = np.array(sans_labels, dtype='str')
# print(slabels)
# print(slabels.shape)



In [ ]:

    
from matplotlib import offsetbox
i_compare = 0

mn = scaled_saxs.min(axis=0)
mx = scaled_saxs.max(axis=0)

# for i0 in range(1):
for i0 in range(18):
    plt.figure()
    
    # plot points to make the correct box size
    plt.plot(mn[i0], mn[i_compare], 'w.')
    plt.plot(mx[i0], mx[i_compare], 'w.')
    
    for j in range(len(scaled_saxs)):
        if slabels[j] != '0':
            plt.text(scaled_saxs[j, i0], scaled_saxs[j, i_compare], slabels[j],
                     fontdict={'weight': 'bold', 'size': 15}, 
                     color='r') # plt.cm.Set1(labels[i]/10.0))
        else:
            plt.plot(scaled_saxs[j, i0], scaled_saxs[j, i_compare], 'k.',
                    markersize=5)
                
    plt.xlabel(r'$Q_{}$'.format(i0))
    plt.ylabel(r'$Q_{}$'.format(i_compare))

HDBSCAN



In [298]:

    
import hdbscan



In [326]:

    
# Compute DBSCAN
## Tune these parameters to adjust cluster size ##
min_samples = 2
##################################################

x_db = hdbscan.HDBSCAN(min_cluster_size=min_samples).fit_predict(scaled_saxs)



In [323]:

    
x_db









    Out[323]:





HDBSCAN(algorithm='best', allow_single_cluster=False, alpha=1.0,
    approx_min_span_tree=True, core_dist_n_jobs=4, gen_min_span_tree=False,
    leaf_size=40, match_reference_implementation=False,
    memory=Memory(cachedir=None), metric='euclidean', min_cluster_size=2,
    min_samples=None, p=None)



In [327]:

    
saxs_labels = x_db + 1 # 0's are independent groups
x_clusters_ = len(set(saxs_labels)) - (1 if -1 in saxs_labels else 0)

n_db = hdbscan.HDBSCAN(min_cluster_size=min_samples).fit_predict(scaled_sans)
sans_labels = n_db + 1 # 0's are independent groups
n_clusters_ = len(set(sans_labels)) - (1 if -1 in sans_labels else 0)

p_db = hdbscan.HDBSCAN(min_cluster_size=min_samples).fit_predict(scaled_pr)
pr_labels = p_db + 1 # 0's are independent groups
p_clusters_ = len(set(pr_labels)) - (1 if -1 in pr_labels else 0)



In [328]:

    
# x-ray clusters
x_unique = set(saxs_labels)
x_unique.remove(0)
print('cluster labels: {}'.format(x_unique))
print('unique clusters: {}'.format(len(x_unique) + list(saxs_labels).count(0)))
for c in set(saxs_labels):
    print('{}: {}'.format(c, list(saxs_labels).count(c)))









    



cluster labels: {1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32}
unique clusters: 96
0: 64
1: 10
2: 3
3: 4
4: 3
5: 4
6: 3
7: 5
8: 4
9: 5
10: 4
11: 4
12: 2
13: 6
14: 2
15: 5
16: 3
17: 2
18: 2
19: 3
20: 5
21: 8
22: 6
23: 7
24: 3
25: 5
26: 4
27: 4
28: 3
29: 5
30: 3
31: 5
32: 4



In [329]:

    
# neutron clusters
unique = set(sans_labels)
unique.remove(0)
total_clusters = len(unique) + list(sans_labels).count(0)
print('cluster labels: {}'.format(unique))
print('unique clusters: {}'.format(total_clusters))
for c in set(sans_labels):
    print('{}: {}'.format(c, list(sans_labels).count(c)))









    



cluster labels: {1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31}
unique clusters: 88
0: 57
1: 10
2: 2
3: 3
4: 4
5: 4
6: 5
7: 5
8: 5
9: 3
10: 5
11: 4
12: 4
13: 4
14: 3
15: 6
16: 6
17: 2
18: 7
19: 5
20: 5
21: 5
22: 4
23: 2
24: 5
25: 8
26: 3
27: 3
28: 5
29: 7
30: 5
31: 4



In [330]:

    
# pr clusters
unique = set(pr_labels)
unique.remove(0)
total_clusters = len(unique) + list(pr_labels).count(0)
print('cluster labels: {}'.format(unique))
print('unique clusters: {}'.format(total_clusters))
for c in set(pr_labels):
    print('{}: {}'.format(c, list(pr_labels).count(c)))









    



cluster labels: {1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57}
unique clusters: 85
0: 28
1: 2
2: 6
3: 4
4: 3
5: 2
6: 4
7: 5
8: 3
9: 2
10: 3
11: 4
12: 3
13: 2
14: 2
15: 2
16: 2
17: 2
18: 2
19: 4
20: 3
21: 4
22: 3
23: 2
24: 2
25: 2
26: 3
27: 6
28: 2
29: 4
30: 3
31: 2
32: 2
33: 2
34: 2
35: 2
36: 2
37: 3
38: 2
39: 2
40: 2
41: 3
42: 7
43: 3
44: 2
45: 3
46: 6
47: 5
48: 4
49: 6
50: 3
51: 2
52: 3
53: 3
54: 2
55: 3
56: 3
57: 2



In [331]:

    
np.savetxt('saxs_clusters.txt', saxs_labels, fmt='%d')
np.savetxt('sans_clusters.txt', sans_labels, fmt='%d')
np.savetxt('pr_clusters.txt', pr_labels, fmt='%d')



In [332]:

    
print(sans_labels)
print(sans_labels.shape)
slabels = np.array(sans_labels, dtype='str')
# print(slabels)
# print(slabels.shape)









    



[11 11 11 11  0 17 14 17  0 14 14 15  0 15  0  0 15 15 15 15  1  1  1  1  1
  1 13  0 13 13 13  0 22 22 18 18 18 18 18 18  0  0  0  0 20 21 21 21  0 21
 20  0  0  0 22 20 20 20 21  2  0  2 16  0 12 16 16  0 19 16  1  1  1  1 22
  0  0  0  0  0 12  0  0  7  7  7  7  7  8  8  8  0 16 16 19 19 19 19  0  0
  0 18  0  0 25 25 25 25 12 12 25 25 24 24 24 24  0 24  0 25 25  0  0  0  0
  0  0 29 29 30 31 29 29 31 31 30 30 30 30 10 10  5 10 10 10  5  5  0  5  0
  0  0 23  0 27 27 28 28 28 26 26 26 28 27 28 23  0  0  0  0  0 29 31 29  0
  0  9  9  9  0  8  8  0  6  6  6  6  6  0  3  3  4  4  3  4  4 29  0  0  0]
(200,)



In [333]:

    
from matplotlib import offsetbox
i_compare = 0

mn = scaled_saxs.min(axis=0)
mx = scaled_saxs.max(axis=0)

# for i0 in range(1):
for i0 in range(18):
    plt.figure()
    
    # plot points to make the correct box size
    plt.plot(mn[i0], mn[i_compare], 'w.')
    plt.plot(mx[i0], mx[i_compare], 'w.')
    
    for j in range(len(scaled_saxs)):
        if slabels[j] != '0':
            plt.text(scaled_saxs[j, i0], scaled_saxs[j, i_compare], slabels[j],
                     fontdict={'weight'metrics: 'bold', 'size': 15}, 
                     color='r') # plt.cm.Set1(labels[i]/10.0))
        else:
            plt.plot(scaled_saxs[j, i0], scaled_saxs[j, i_compare], 'k.',
                    markersize=5)
                
    plt.xlabel(r'$Q_{}$'.format(i0))
    plt.ylabel(r'$Q_{}$'.format(i_compare))

Write DCD output



In [ ]:

    
import sasmol.sasmol as sasmol

dcd_fname = glob.glob('*.dcd')
assert len(dcd_fname) == 1, 'ERROR: unsure which dcd file to use: {}'.format(dcd_fname)
dcd_fname = dcd_fname[0]

pdb_fname = glob.glob('*.pdb')
assert len(pdb_fname) == 1, 'ERROR: unsure which dcd file to use: {}'.format(pdb_fname)
pdb_fname = pdb_fname[0]



In [ ]:

    
mol = sasmol.SasMol(0)
mol.read_pdb(pdb_fname)



In [ ]:

    
if not np.alltrue(sans_labels == saxs_labels):
    print('WARNING: labels do not match\nusing neutron labels')
labels = sans_labels



In [ ]:



In [ ]:

    
dcd_fname



In [ ]:

    
# create a dcd for every cluster with >1 frame
dcd_fnames = []
cluster_out_files = [] # dcds for clusters
unique_out_fname = '{}_uniue.dcd'.format(dcd_fname[:-4]) 
dcd_out_file = mol.open_dcd_write(unique_out_fname) # dcd file for unique structures

dcd_in_file = mol.open_dcd_read(dcd_fname)

for i in xrange(len(unique)):
    dcd_fnames.append('{}_c{:02d}.dcd'.format(dcd_fname[:-4], i))
    cluster_out_files.append(mol.open_dcd_write(dcd_fnames[i]))

visited_cluster = set()
dcd_out_frame = 0
cluster_out_frame = np.zeros(len(unique), dtype=int)

for (i, label) in enumerate(labels):
    mol.read_dcd_step(dcd_in_file, i)
    if label == 0:
        dcd_out_frame += 1
        mol.write_dcd_step(dcd_out_file, 0, dcd_out_frame)
    else:
        cluster_out_frame[label-1] += 1
        # print('adding frame to cluster {}'.format(label-1))
        # print(cluster_out_frame)
        mol.write_dcd_step(cluster_out_files[label-1], 0, cluster_out_frame[label-1])
        if label not in visited_cluster:
            visited.add(label)
            dcd_out_frame += 1
            mol.write_dcd_step(dcd_out_file, 0, dcd_out_frame)
        
for cluster_out_file in cluster_out_files:
    mol.close_dcd_write(cluster_out_file)

mol.close_dcd_write(dcd_out_file)    
mol.close_dcd_read(dcd_in_file[0])

PCA Analysis



In [ ]:

    
from sklearn.decomposition import PCA
from mpl_toolkits.mplot3d import Axes3D
import numpy as np
import matplotlib.pyplot as plt
from scipy import stats



In [ ]:

    
pca_orig = PCA()
pca_orig.fit(saxs_data)



In [ ]:

    
pca_scaled = PCA()
pca_scaled.fit(scaled_saxs)



In [ ]:

    
print(pca_orig.explained_variance_ratio_)
print(pca_scaled.explained_variance_ratio_)



In [ ]:

    
plt.figure()
plt.plot(q_values, pca_orig.explained_variance_ratio_, 'o', label='unscaled')
plt.plot(q_values, pca_scaled.explained_variance_ratio_, 's', label='scaled')
plt.legend()



In [ ]:



In [ ]:

    
from sklearn.datasets.samples_generator import make_blobs



In [ ]:

    
centers = [[1, 1], [-1, -1], [1, -1]]
X, labels_true = make_blobs(n_samples=750, centers=centers, cluster_std=0.4,
                            random_state=0)



In [ ]:

    
X_scaled = StandardScaler().fit_transform(X)



In [ ]:

    
X_range = X.max(axis=0) - X.min(axis=0)
print(X_range)



In [ ]:

    
saxs_scaled_range = X_scaled.max(axis=0) - X_scaled.min(axis=0)
print(saxs_scaled_range)



In [ ]:

    
X_s2 = StandardScaler().fit_transform(X)



In [ ]:

    
X_s2_range = X_s2.max(axis=0) - X_s2.min(axis=0)
print(X_s2_range)



In [ ]:



In [ ]:

	c0	c1	c2	c3	c4	c5	c6	c7	c8	c9	...	c191	c192	c193	c194	c195	c196	c197	c198	c199	q
0	0.917798	0.917891	0.917723	0.917790	0.916672	0.916286	0.917395	0.917101	0.917916	0.920588	...	0.929958	0.929471	0.928210	0.929580	0.929479	0.920882	0.918059	0.915479	0.915412	0.01
1	0.711798	0.712118	0.711597	0.711866	0.707277	0.706311	0.709269	0.708664	0.710933	0.718504	...	0.749546	0.748034	0.744269	0.748370	0.748000	0.719319	0.710378	0.702664	0.702479	0.02
2	0.474336	0.474924	0.474168	0.474622	0.464832	0.464067	0.467067	0.467008	0.469496	0.478496	...	0.528697	0.526529	0.521513	0.526983	0.526311	0.479227	0.465958	0.455723	0.455479	0.03
3	0.287303	0.288050	0.287303	0.287555	0.273513	0.273714	0.274563	0.275706	0.276546	0.281588	...	0.335605	0.333647	0.329756	0.334017	0.333134	0.281017	0.268790	0.261378	0.261134	0.04
4	0.179109	0.179857	0.179261	0.178765	0.164588	0.165555	0.164193	0.166034	0.164571	0.163933	...	0.203950	0.202840	0.201395	0.203008	0.202101	0.161176	0.154756	0.153420	0.153134	0.05

	c0	c1	c2	c3	c4	c5	c6	c7	c8	c9	...	c191	c192	c193	c194	c195	c196	c197	c198	c199	q
0	0.919420	0.919513	0.919345	0.919370	0.918168	0.917815	0.918866	0.918630	0.919370	0.921933	...	0.931681	0.931202	0.929950	0.931303	0.931202	0.922336	0.919496	0.916933	0.916849	0.01
1	0.716874	0.717202	0.716681	0.716840	0.711958	0.711076	0.713891	0.713445	0.715496	0.722782	...	0.755151	0.753639	0.749891	0.753966	0.753597	0.723899	0.714882	0.707126	0.706891	0.02
2	0.481966	0.482563	0.481807	0.482134	0.471891	0.471176	0.474059	0.474151	0.476361	0.485084	...	0.537605	0.535420	0.530387	0.535857	0.535176	0.486118	0.472580	0.462151	0.461815	0.03
3	0.295185	0.295966	0.295202	0.295420	0.280815	0.281000	0.281840	0.283017	0.283639	0.288622	...	0.345370	0.343370	0.339437	0.343731	0.342824	0.287950	0.275244	0.267462	0.267109	0.04
4	0.185849	0.186655	0.186042	0.185639	0.170824	0.171731	0.170420	0.172176	0.170622	0.170092	...	0.212361	0.211193	0.209739	0.211353	0.210403	0.166605	0.159647	0.157966	0.157597	0.05

	c0	c1	c2	c3	c4	c5	c6	c7	c8	c9	...	c191	c192	c193	c194	c195	c196	c197	c198	c199	r
0	0	0	0	0	0	0	1	0	0	0	...	2	1	1	1	2	0	0	0	0	0
1	17067	17068	17068	17067	17070	17066	17073	17069	17076	17079	...	17101	17100	17099	17098	17091	17066	17066	17066	17066	1
2	50223	50248	50252	50265	50216	50207	50226	50219	50231	50226	...	50324	50310	50304	50312	50303	50202	50198	50199	50200	2
3	97762	97863	97875	97883	97600	97613	97605	97608	97604	97633	...	97901	97878	97836	97866	97828	97547	97542	97555	97547	3
4	154904	155068	155072	155081	154428	154434	154439	154443	154385	154410	...	154899	154866	154798	154872	154806	154143	154154	154178	154185	4